EN FR
EN FR


Section: New Results

Decision Making

Optimizing Automated Service Discovery

Participant : Jörg Hoffmann.

Michael Stollberg (SAP Research, Germany) and Dieter Fensel (University of Innsbruck, Austria) are external collaborators.

We completed earlier work, done while all authors were employed at the University of Innsbruck, and published it in the International Journal of Semantic Computing [10] . In a nutshell, the work proposes to use first-order logic for annotating web services to accomplish better precisiona nd recall in service discovery; its core contribution is a technique making such discovery more effective – discovery here involves first-order logical reasoning – by designing a caching technique storing known relationships between available services and possible discovery queries.

Overview of Semantic Web Service Technologies

Participant : Jörg Hoffmann.

Stijn Heymans (SemanticBits, USA), Annapaola Marconi (Fondazione Bruno Kessler, Trento, Italy), Joshua Phillips (SemanticBits, USA), and Ingo Weber (University of New South Wales, Sydney, Australia) are external collaborators.

We were invited to write a book chapter about the basic AI technologies underlying semantic Web service discovery and composition. The chapter has been published as part of a book entitled “Handbook of Service Description – USDL and its Methods” in Springer-Verlag [46] .

Analyzing Planning Domains to Predict Heuristic Function Quality

Participant : Jörg Hoffmann.

The heuristic search approach to planning (cf. the above) rises and falls with the quality of the heuristic estimates. The dominant method, especially in satisficing (non-optimal) planning, is to approximate a heuristic function called h + – this is used in almost every state of the art satisficing planning system. In earlier work, Jörg Hoffmann showed that h + has some amazing qualities, in many traditional planning benchmarks, in particular pertaining to the complete absence of local minima. [62] His proofs of this are hand-made, raising the question whether such proofs can be lead automatically by domain analysis techniques. The possible uses of such analysis are manifold, e.g., for automatic configuration of hybrid planners or for giving hints how to improve the domain design. The question has been open since 2002. A serious attempt of Jörg Hoffmann resulted in disappointing results – his analysis method has exponential runtime and succeeds only in two extremely simple benchmark domains. In contrast to this, in our work here we answer the question in the affirmative. We establish connections between certain easily testable syntactical structures, called “causal graphs”, and h + topology. This results in low-order polynomial time analysis methods, implemented in the Torchlight tool, cf. Section  5.2 . Of the 12 domains where Hoffmann proved the absence of local minima, TorchLight gives strong success guarantees in 8 domains. Empirically, its analysis exhibits strong performance in a further 2 of these domains, plus in 4 more domains where local minima may exist but are rare. We show that, in this way, TorchLight can distinguish Hoffmann's “easy” domains from the “hard” ones. By summarizing structural reasons for analysis failure, TorchLight also provides diagnostic output pin-pointing potentially problematic aspects of the domain. A conference paper on this work was published at ICAPS 2011 [25] , and nominated for the best paper award there. A journal paper was published in the Journal of AI Research (JAIR) [9] .

Relaxing Bisimulation for State Aggregation in the Computation of Lower Bounds

Participant : Jörg Hoffmann.

Raz Nissim (Ben-Gurion University, Beer-Sheva, Israel) and Malte Helmert (University of Freiburg, Germany) are external collaborators.

Like the previous line of work, this addresses planning as heuristic search, specifically the automatic generation of heuristic estimates. This is also the core question investigated in the BARQ project, see below. In preparation of this project, we are conducting this line of research, which explores some of the most basic ideas behind BARQ. The basic technique under consideration was developed in prior work outside INRIA. [61] The heuristic estimates are lower bounds generated from a quotient graph in which sets of states are aggregated into equivalence classes. A major difficulty in designing such classes is that there are exponentially many states. Despite this, our technique allows explicit selection of individual states to aggregate, via an incremental process interleaving it with state space re-construction steps. We have shown previously that, if the aggregation decisions are perfect, then this technique dominates the other known related techniques, and sometimes produces perfect estimates in polynomial time. But how to take these decisions? Little is known about this as yet. In the present work, we start from the notion of a “bisimulation”, which is a well-known criterion from model checking implying that the quotient system is behaviorally indistinguishable from the original system – in particular, the cost estimates based on a bisimulation are perfect. However, bisimulations are exponential even in trivial planning benchmarks. We observe that bisimulation can be relaxed without losing any information as far as the cost estimates are concerned. Namely, we can ignore the “content of the messages sent”, i.e., the state transition labels. Such relaxed bisimulations are often exponentially smaller than the original ones. We show to what extent such relaxation can be applied also within our incremental construction process. As a result, in several benchmarks we obtain perfect estimates in polynomial time, and we significantly increase the set of benchmark instances that can be solved with this approach. Indeed, the approach obtained a 2nd place in the optimal track of the 2011 International Planning Competition, and was part of the 1st-prize winning portfolio. A conference paper was published at IJCAI 2011 [28] , and a journal paper is under preparation for submission to the Journal of the ACM.

Relaxing Bisimulation by Choosing Transition Subsets

Participants : Michael Katz, Jörg Hoffmann.

Malte Helmert (University of Freiburg, Germany) is an external collaborator.

This line of work builds on the previous one by designing new methods for relaxing bisimulations. The key idea is to apply the bisimulation property to only a subset of the transitions in the system under consideration. We showed that one can ignore large subsets of transitions without losing any information, i.e., while still guaranteeing to obtain a perfect heuristic. At the same time, such a relaxed bisimulation makes less distinctions and may thus be exponentially smaller. For practical purposes, we designed several approximate strategies relaxing more, obtaining smaller abstractions at the expense of informtion loss. The techniques are currently being evaluated empirically, and a paper submission is in preparation for ICAPS'12.

Improving h + by Taking Into Account (Some) Negative Effects

Participants : Emil Keyder, Jörg Hoffmann.

Patrik Haslum (NICTA, Australia) is an external collaborator.

Like the previous lines, this is on planning as heuristic search. As mentioned above in Section  6.1.3 , approximating the h + heuristic is the dominant approach to obtain estimates in satisficing (non-optimal) planning. That notwithstanding, h + is obtained by ignoring all negative effects, which of course leads to very bad estimates in domains where these domains play a key role, for example puzzle-like domains, e.g. Rubic's cube, where actions intrefere intensively with each other. It has long (for almost 10 years) been an active research issue how to take at least some of the negative effects into account when computing h + . All attempts, however, remained at rather ad-hoc methods, like, counting the number of violated binary constraints (pairs of facts that cannot be true at the same time) within the relaxed plan underlying the estimate. In the present work, for the first time we provide a well-founded formal approach to the issue. As was suggested in prior work, [60] , we design a compiled planning task which introduces constructs allowing h + to correctly handle a subset C of fact conjunctions. Whereas this prior work requires a compilation exponential in |C| – and thus allows only to introduce very few conjunctions – in our work we designed a compilation that is linear in |C|. We proved that one can always choose C so that h + in the compiled task is a perfect heuristic. Of course, in general C might have to be exponentially large to achieve this. We designed practical methods selecting C in a way so that the overhead (the size of C) is kept at bay, while the quality of the heuristic is sufficiently improved to boost search performance. The techniques are currently being evaluated empirically, and a paper submission is in preparation for ICAPS'12.

Accounting for Uncertainty in Penetration Testing

Participants : Olivier Buffet, Jörg Hoffmann.

Carlos Sarraute (Core Security Technologies) is an external collaborator.

Core Security Technologies is an U.S.-American/Argentinian company providing, amongst other things, tools for (semi-)automated security checking of computer networks against outside hacking attacks. For automation of such checks, a module is needed that automatically generates potential attack paths. Since the application domain is highly dynamic, a module allowing to declaratively specify the environment (the network and its configuration) is highly advantageous. For that reason, Core Security Technologies have been looking into using AI Planning techniques for this purpose. After consulting by Jörg Hoffmann (see also Section  7.1.1 below), they are now using a variant of Jörg Hoffmann's FF planner (cf. Section  5.1 ) in their product. While that solution is satisfactory in many respects, it also has weaknesses. The main weakness is that it does not handle the incomplete knowledge in this domain – figuratively speaking, the attacker is assumed to have perfect information about the network. This results in high costs in terms of runtime and network traffic, for extensive scanning activities prior to planning. We are currently working with Core Security's research department to overcome this issue, by modeling and solving the attack planning problem as a POMDP instead. A workshop paper detailing the POMDP model has been published at SecArt'11 [29] . While such a model yields much higher quality attacks, solving an entire network as a POMDP is not feasible. We have designed a decomposition method making use of network structure and approximations to overcome this problem, by using the POMDP model only to find good-quality attacks on single machines, and propagating the results through the network in an appropriate manner. A conference paper is in preparation for submission to ICAPS'12.

Searching for Information with MDPs

Participants : Mauricio Araya, Olivier Buffet, Vincent Thomas, François Charpillet.

In the context of Mauricio Araya's PhD, we are working on how MDPs —or related models— can search for information. This has led to various research directions that we describe now.

A POMDP Extension with Belief-dependent Rewards — A limitation of Partially Observable Markov Decision Processes (POMDPs) is that they only model problems where the performance criterion depends on the state-action history. This excludes for example scenarios where one wants to maximize the knowledge with respect to some random variables.

To overcome this limitation, we have proposed ρ-POMDPs, an extension of POMDPs in which the reward function depends on the belief state rather than on the state. In this framework, and under the hypothesis that the reward function is convex, we have proved that:

  • the value function itself is convex; and

  • if the reward function is α-Hölder, then the value function can be approximated arbitrarily well with a piecewise linear and convex function.

These results allow for adapting a number of solution algorithms relying on approximating the value function.

This theoretical work has been first published in an international conference in December 2010, then in [36] , where it has received a best paper award.

We are currently pursuing experimental work about the proposed algorithm.

Active Learning of MDP Models — Reinforcement Learning is about learning how to perform a task by trial and error (no model of the system to control being available). Model-based Bayesian RL (BRL) consists in all RL algorithms that maintain a belief (in the Bayesian sense) about the model of the system to control. In fact, this is a way to turn an RL problem into a POMDP—the unknown model becoming an unobservable part of the state—, thus replacing the exploration-exploitation dilemma by the definition of a prior belief over possible models.

A particular BRL task we have been considering is to actively learn the dynamical model itself, i.e., to act so as to improve the knowledge about the transition function. In a way this means solving a ρ-POMDP since the reward depends on a belief, not on a state. To that end, we have proposed several optimization criteria, and derived the corresponding reward functions, making sure that their computational complexity allows for their use in a BRL algorithm. We have also proved that a non-optimistic BRL algorithm—exploit —could be used in this particular case.

This work, along with experiments, has been published in [36] and [35] (french version).

PAC-BAMDP Algorithms — Exact or approximate solutions to Model-based Bayesian RL are impractical, so that a number of heuristic approaches have been considered, most of them relying on the principle of “optimism in the face of uncertainty”. Some of these algorithms have properties that guarantee the quality of their outcome, inspired by the PAC-learning (Probably Approximately Correct) framework. For example, some algorithms provably make in most cases the same decision as would be made if the true model were known (PAC-MDP property).

We have proposed a novel optimistic algorithm, bouh , that is

  • appealing in that it is (i) optimistic about the uncertainty in the model and (ii) deterministic (thus easier to study); and

  • provably PAC-BAMDP, i.e., makes in most cases the same decision as a perfect BRL algorithm would.

First results about this algorithm are currently under review.

Scheduling for Probabilistic Realtime Systems

Participant : Olivier Buffet.

Maxim Dorin, Luca Santinelli, Liliana Cucu-Grosjean (INRIA, TRIO team), and Rob Davies (U. of York) are external collaborators.

In this collaborative research work (mainly with the TRIO team), we look at the problem of scheduling periodic tasks on a single processor, in the case where each task's period is a (known) random variable. In this setting, some job will necessarily be missed, so that one will try to satisfy some criteria depending on the number of deadline misses.

We have proposed three criteria: (1) satisfying pre-defined deadline miss ratios, (2) minimizing the worst deadline miss ratio, and (3) minimizing the average deadline miss ratio. For each criterion we propose an algorithm that computes a provably optimal fixed priority assignment, i.e., a solution obtained by assigning priorities to tasks and executing jobs by order of priority.

This work has been presented in [26] .

We also collaborate on other topics linked to real-time scheduling, as (i) on search algorithms for deterministic, but multiprocessor, problems [38] , and (ii) on the problem of which jobs to drop (on-going work).

Adaptive Management with POMDPs

Participant : Olivier Buffet.

Iadine Chadès, Josie Carwardine, Tara G. Martin (CSIRO), Samuel Nicol (U. of Alaska Fairbanks) and Régis Sabbadin (INRA) are external collaborators.

In the field of conservation biology, adaptive management is about managing a system, e.g., performing actions so as to protect some endangered species, while learning how it behaves. This is a typical reinforcement learning task that could for example be addressed through BRL.

Here, we consider that a number of experts provide us with one possible model each, assuming that one of them is the true model. This allows making decisions by solving a mixed observability MDP (MOMDPs), where the hidden part of the state corresponds to the model (in cases where all other variables are fully observable).

We have conducted preliminary studies of this approach, using the scenario of the protection of the Gouldian finch, and focusing on the particular characteristics that could be exploited to more efficiently solve this problem. First results have been presented in [39] .

Information Gathering with Sensor Systems

Participant : Olivier Buffet.

Elodie Chanthery, Matthieu Godichaud (LAAS-CNRS) and Marc Contat (EADS) are external collaborators.

The DOPEC project was a DGA PEA (upstream studies project) on the optimization of the use of sensor systems. In collaboration with EADS (project leader) and the LAAS, we have worked on autonomous sequential decision making problems. We were more particularly interested, on the one hand, in multi-agent problems and, on the other hand, in taking uncertainties into account.

The overall architecture that has been developed in the context of this project was presented in a national and an international conference [40] , [23] .

How do real rats solve non-stationary (PO)MDPs ?

Participant : Alain Dutech.

Etienne Coutureau and Alain Marchand (Centre de Neurosciences Intégratives et Cognitives (CNIC), UMR 5228, Bordeaux) are external collaborators.

For a living entity, using simultaneously various ways for learning models or representations of its environment can be very useful to adapt itself to non-stationary environments in a Reinforcement Learning setting. In the rats and in the monkey, two different action control systems lie in specific regions of the prefrontal cortex. Neurobiologists and computer scientists find here a common ground to identify and model these systems and the selection mechanisms between them, selection that could depend on uncertainty or error signals. Using real data collected on rats with or without prefrontal lesions, reinforcement learning models are used and evaluated in order to better understand this behavioral flexibility. MAIA is more particularly involved as a reinforcement learning expert in order to suggest and build models of the various learning mechanisms. In particular, we have used an on-policy learning scheme (SARSA) to investigate how well the use of simple or complex representations (with or without memory of the immediate past) can best model the learning behavior of rats in instrumental contingency degradation tasks [7] .

This work has led us to investigate in more details the relations between the prefrontal cortex and the basal ganglia and their respective role when rats learn to solve non-stationnary tasks. The research is conducted through the PEPII project IMAVO (see  8.2.7 ).

Developmental Reinforcement Learning

Participants : Alain Dutech, Olivier Buffet.

Luc Sarzyniec and Joël Legrand (M2R Student of UHP Nancy 1) are external collaborators.

The goal of this work is to investigate how reinforcement learning can benefit from a developmental approach in the field of robotics. Instead of having a robot directly learn a difficult task using appropriate but rich (in the number of dimensions) sensory and motor spaces, we have followed an incremental approach. Both the number of perception and action dimensions increase only when the performance of the learned behavior increases. At the core of the algorithm lies a neuronal approximator used to compute the value function of the current policy of the robot. When the perception or action space grow, neurons or networks, initialized from existing neurons and networks, are added to the control architecture.

Thus far, our research focussed on the approximation architecture used to evaluate the Q-function. In simple robotic task, we investigated the use of Multi-Layer Perceptrons, either one approximation for every possible action ([41] ) or one unique global approximator with as many outputs as the number of actions (Master Thesis of Joël Legrand). Currently, a reservoir computing architecture is under study as depicted in [16] .

Classification-based Policy Iteration with a Critic

Participant : Bruno Scherrer.

Victor Gabillon, Alessandro Lazaric and Mohammad Ghavamzadeh (from Sequel INRIA-Lille) are external collaborators.

We study the effect of adding a value function approximation component (critic) to rollout classification-based policy iteration (RCPI) algorithms. The idea is to use the critic to approximate the return after we truncate the rollout trajectories. This allows us to control the bias and variance of the rollout estimates of the action-value function that are strongly related to the length of the rollout trajectories. Therefore, the introduction of a critic can improve the accuracy of the rollout estimates, and as a result, enhance the performance of the RCPI algorithm. We present in [49] , [20] a new RCPI algorithm, called direct policy iteration with critic (DPI-Critic), and provide its finite-sample analysis when the critic is based on LSTD and BRM methods. We empirically evaluate the performance of DPI-Critic and compare it with DPI and LSPI in two benchmark reinforcement learning problems.

Linear Approximation of Value Functions

Participant : Bruno Scherrer.

Matthieu Geist (Supélec, Metz) is an external collaborator

In the framework of Markov Decision Processes, we consider the problem of learning a linear approximation of the value function of some fixed policy from one trajectory possibly generated by some other policy.

In [30] , [42] , [51] , we describe a systematic approach for adapting on-policy learning least squares algorithms of the literature (LSTD, LSPE, FPKF and GPTD/KTD) to off-policy learning with eligibility traces. This leads to two known algorithms, LSTD(λ)/LSPE(λ) and suggests new extensions of FPKF and GPTD/KTD. We describe their recursive implementation, discuss their convergence properties, and illustrate their behavior experimentally. Overall, our study suggests that the state-of-art LSTD(λ) remains the best least-squares algorithm.

We also consider the task of feature selection. A promising approach consists in combining the Least-Squares Temporal Difference (LSTD) algorithm with 1 -regularization, which has proven to be effective in the supervised learning community. This has been done recently whit the LARS-TD algorithm, which replaces the projection operator of LSTD with an 1 -penalized projection and solves the corresponding fixed-point problem. However, this approach is not guaranteed to be correct in the general off-policy setting. In [21] , we take a different route by adding an 1 -penalty term to the projected Bellman residual, which requires weaker assumptions while offering a comparable performance. This comes at the cost of a higher computational complexity if only a part of the regularization path is computed. Nevertheless, our approach ends up to a supervised learning problem, which let envision easy extensions to other penalties.